299 research outputs found

    Approximate Near Neighbors for General Symmetric Norms

    Full text link
    We show that every symmetric normed space admits an efficient nearest neighbor search data structure with doubly-logarithmic approximation. Specifically, for every nn, d=no(1)d = n^{o(1)}, and every dd-dimensional symmetric norm \|\cdot\|, there exists a data structure for poly(loglogn)\mathrm{poly}(\log \log n)-approximate nearest neighbor search over \|\cdot\| for nn-point datasets achieving no(1)n^{o(1)} query time and n1+o(1)n^{1+o(1)} space. The main technical ingredient of the algorithm is a low-distortion embedding of a symmetric norm into a low-dimensional iterated product of top-kk norms. We also show that our techniques cannot be extended to general norms.Comment: 27 pages, 1 figur

    Distance-Sensitive Hashing

    Get PDF
    Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measures. For example, this is true for output-sensitive similarity search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query point. In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions such that the probability of two points having the same hash value is a given function of the distance between them. More precisely, given a distance space (X,dist)(X, \text{dist}) and a "collision probability function" (CPF) f ⁣:R[0,1]f\colon \mathbb{R}\rightarrow [0,1] we seek a distribution over pairs of functions (h,g)(h,g) such that for every pair of points x,yXx, y \in X the collision probability is Pr[h(x)=g(y)]=f(dist(x,y))\Pr[h(x)=g(y)] = f(\text{dist}(x,y)). Locality-sensitive hashing is the study of how fast a CPF can decrease as the distance grows. For many spaces, ff can be made exponentially decreasing even if we restrict attention to the symmetric case where g=hg=h. We show that the asymmetry achieved by having a pair of functions makes it possible to achieve CPFs that are, for example, increasing or unimodal, and show how this leads to principled solutions to problems not addressed by the LSH framework. This includes a novel application to privacy-preserving distance estimation. We believe that the DSH framework will find further applications in high-dimensional data management.Comment: Accepted at PODS'18. Abstract shortened due to character limi

    Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

    Full text link
    The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution H\mathcal{H} over locality-sensitive hash functions that partition space. For a collection of nn points, after preprocessing, the query time is dominated by O(nρlogn)O(n^{\rho} \log n) evaluations of hash functions from H\mathcal{H} and O(nρ)O(n^{\rho}) hash table lookups and distance computations where ρ(0,1)\rho \in (0,1) is determined by the locality-sensitivity properties of H\mathcal{H}. It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to O(log2n)O(\log^2 n), leaving the query time to be dominated by O(nρ)O(n^{\rho}) distance computations and O(nρlogn)O(n^{\rho} \log n) additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework, making it a viable replacement in practice. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to O(nρ)O(n^\rho).Comment: 15 pages, 3 figure

    Taylor Polynomial Estimator for Estimating Frequency Moments

    Full text link
    We present a randomized algorithm for estimating the ppth moment FpF_p of the frequency vector of a data stream in the general update (turnstile) model to within a multiplicative factor of 1±ϵ1 \pm \epsilon, for p>2p > 2, with high constant confidence. For 0<ϵ10 < \epsilon \le 1, the algorithm uses space O(n12/pϵ2+n12/pϵ4/plog(n))O( n^{1-2/p} \epsilon^{-2} + n^{1-2/p} \epsilon^{-4/p} \log (n)) words. This improves over the current bound of O(n12/pϵ24/plog(n))O(n^{1-2/p} \epsilon^{-2-4/p} \log (n)) words by Andoni et. al. in \cite{ako:arxiv10}. Our space upper bound matches the lower bound of Li and Woodruff \cite{liwood:random13} for ϵ=(log(n))Ω(1)\epsilon = (\log (n))^{-\Omega(1)} and the lower bound of Andoni et. al. \cite{anpw:icalp13} for ϵ=Ω(1)\epsilon = \Omega(1).Comment: Supercedes arXiv:1104.4552. Extended Abstract of this paper to appear in Proceedings of ICALP 201

    On the segmentation and classification of hand radiographs

    Get PDF
    This research is part of a wider project to build predictive models of bone age using hand radiograph images. We examine ways of finding the outline of a hand from an X-ray as the first stage in segmenting the image into constituent bones. We assess a variety of algorithms including contouring, which has not previously been used in this context. We introduce a novel ensemble algorithm for combining outlines using two voting schemes, a likelihood ratio test and dynamic time warping (DTW). Our goal is to minimize the human intervention required, hence we investigate alternative ways of training a classifier to determine whether an outline is in fact correct or not. We evaluate outlining and classification on a set of 1370 images. We conclude that ensembling with DTW improves performance of all outlining algorithms, that the contouring algorithm used with the DTW ensemble performs the best of those assessed, and that the most effective classifier of hand outlines assessed is a random forest applied to outlines transformed into principal components

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    Development of a Low-Cost Optical Sensor to Detect Eutrophication in Irrigation Reservoirs

    Full text link
    [EN] In irrigation ponds, the excess of nutrients can cause eutrophication, a massive growth of microscopic algae. It might cause different problems in the irrigation infrastructure and should be monitored. In this paper, we present a low-cost sensor based on optical absorption in order to determine the concentration of algae in irrigation ponds. The sensor is composed of 5 LEDs with different wavelengths and light-dependent resistances as photoreceptors. Data are gathered for the calibration of the prototype, including two turbidity sources, sediment and algae, including pure samples and mixed samples. Samples were measured at a different concentration from 15 mg/L to 4000 mg/L. Multiple regression models and artificial neural networks, with a training and validation phase, are compared as two alternative methods to classify the tested samples. Our results indicate that using multiple regression models, it is possible to estimate the concentration of alga with an average absolute error of 32.0 mg/L and an average relative error of 11.0%. On the other hand, it is possible to classify up to 100% of the samples in the validation phase with the artificial neural network. Thus, a novel prototype capable of distinguishing turbidity sources and two classification methodologies, which can be adapted to different node features, are proposed for the operation of the developed prototype.This work is partially funded by the Ministerio de Educacion, Cultura y Deporte through the"Ayudas para contratacion pre-doctoral de Formacion del Profesorado Universitario FPU (Convocatoria 2016)" grant number FPU16/05540 and by the Conselleria de Educacion, Cultura y Deporte through the "Subvenciones para la contratacion de personal investigador en fase postdoctoral", grant number APOSTD/2019/04.Rocher-Morant, J.; Parra-Boronat, L.; Jimenez, JM.; Lloret, J.; Basterrechea-Chertudi, DA. (2021). Development of a Low-Cost Optical Sensor to Detect Eutrophication in Irrigation Reservoirs. Sensors. 21(22):1-20. https://doi.org/10.3390/s21227637S120212

    The prevalence of axial spondyloarthritis in the UK: a cross-sectional cohort study

    Get PDF
    Background: Accurate prevalence data are important when interpreting diagnostic tests and planning for the health needs of a population, yet no such data exist for axial spondyloarthritis (axSpA) in the UK. In this cross-sectional cohort study we aimed to estimate the prevalence of axSpA in a UK primary care population. Methods: A validated self-completed questionnaire was used to screen primary care patients with low back pain for inflammatory back pain (IBP). Patients with a verifiable pre-existing diagnosis of axSpA were included as positive cases. All other patients meeting the Assessment of SpondyloArthritis international Society (ASAS) IBP criteria were invited to undergo further assessment including MRI scanning, allowing classification according to the European Spondyloarthropathy Study Group (ESSG) and ASAS axSpA criteria, and the modified New York (mNY) criteria for ankylosing spondylitis (AS). Results: Of 978 questionnaires sent to potential participants 505 were returned (response rate 51.6 %). Six subjects had a prior diagnosis of axSpA, 4 of whom met mNY criteria. Thirty eight of 75 subjects meeting ASAS IBP criteria attended review (mean age 53.5 years, 37 % male). The number of subjects satisfying classification criteria was 23 for ESSG, 3 for ASAS (2 clinical, 1 radiological) and 1 for mNY criteria. This equates to a prevalence of 5.3 % (95 % CI 4.0, 6.8) using ESSG, 1.3 % (95 % CI 0.8, 2.3) using ASAS, 0.66 % (95 % CI 0.28, 1.3) using mNY criteria in chronic back pain patients, and 1.2 % (95 % CI 0.9, 1.4) using ESSG, 0.3 % (95 % CI 0.13, 0.48) using ASAS, 0.15 % (95 % CI 0.02, 0.27) using mNY criteria in the general adult primary care population. Conclusions: These are the first prevalence estimates for axSpA in the UK, and will be of importance in planning for the future healthcare needs of this population. Trial registration: Current Controlled Trials ISRCTN7687321

    Hardness of Approximate Nearest Neighbor Search

    Full text link
    We prove conditional near-quadratic running time lower bounds for approximate Bichromatic Closest Pair with Euclidean, Manhattan, Hamming, or edit distance. Specifically, unless the Strong Exponential Time Hypothesis (SETH) is false, for every δ>0\delta>0 there exists a constant ϵ>0\epsilon>0 such that computing a (1+ϵ)(1+\epsilon)-approximation to the Bichromatic Closest Pair requires n2δn^{2-\delta} time. In particular, this implies a near-linear query time for Approximate Nearest Neighbor search with polynomial preprocessing time. Our reduction uses the Distributed PCP framework of [ARW'17], but obtains improved efficiency using Algebraic Geometry (AG) codes. Efficient PCPs from AG codes have been constructed in other settings before [BKKMS'16, BCGRS'17], but our construction is the first to yield new hardness results
    corecore